Evaluating the Accuracy of Artificial Intelligence(AI)-Generated Synopses for Plasma Cell Disorder Treatment Regimens

Mohsin, Aleenah; Rubinstein, Samuel M.; Banerjee, Rahul; Mishra, Sanjay; Kwok, Mary; Yang, Peter; Warner, Jeremy Lyle; Cowan, Andrew J.

doi:10.1182/blood-2024-208513

BACKGROUND

Concise, accurate, and real time synopses of the evidence base are critical to support treatment decision-making in hematology, especially in a very rapidly evolving space such as the treatment of plasma cell disorders (PCDs). Synopses are used in clinical practice guidelines, educational settings, and general clinical practice. This process of knowledge curation is time consuming and cumbersome, even when performed by a clinical expert. Artificial intelligence (AI) - specifically, Large Language Models (LLMs) have promise in this context; however, they are prone to hallucination and may provide inaccurate and out of date information. Morever, the degree to which individual LLMs perform in summarizing the clinical evidence base may vary widely. We objectively assessed the abilities of four LLMs to generate accurate, coherent, and relevant synopses for six widely used PCD regimens.

METHODS

We compared the performance of four popular LLMs: 1) Claude 3.5 (“Claude”); 2) ChatGPT 4.0 (“ChatGPT”); 3) Gemini; and 4) Llama-3.1 (“Llama”). Each LLM was prompted exactly as follows: “write a synopsis for the development and evolution of therapy with [regimen name] for PCD, using citations from the literature”, where [regimen name] was replaced with: 1) “Dara-RVd”; 2) “KRd”; 3) “VDT-PACE”; 4) “Dara-CyBorD”; 5) “Elranatamab”; and 6) “Talquetamab.” The generated synopses were assessed by two PCD physician specialists using Likert scales on six criteria: accuracy, completeness, relevance, clarity, hallucinations, and coherence; lower scores correspond to poor performance and higher scores to excellent performance. The reviewers were blinded to LLMs to minimize bias and conducted their evaluations independently. The evaluation process was recorded using REDCap. Mean scores with 95% confidence intervals (CI) were calculated across all domains.

RESULTS

There were marked differences in LLM performance across the six criteria. Claude demonstrated the highest performance in all domains, notably outperforming the other LLMs in accuracy: mean Likert score 3.92 (95% CI 3.54-4.29); ChatGPT 3.25 (2.76-3.74); Gemini 3.17 (2.54-3.80); Llama 1.92 (1.41-2.43); completeness: mean Likert score 4.00 (3.66-4.34); GPT 2.58 (2.02-3.15); Gemini 2.58 (2.02-3.15); Llama 1.67 (1.39-1.95); and lack of hallucinations: mean Likert score 4.00 (4.00-4.00); ChatGPT 2.75 (2.06-3.44); Gemini 3.25 (2.65-3.85); Llama 1.92 (1.26-2.57). Llama performed considerably poorer across all the studied domains, frequently providing inaccurate information and misinterpreted results. ChatGPT and Gemini had intermediate performance. Notably, none of the LLMs including Claude registered perfect accuracy, completeness, or relevance.

CONCLUSION

To our knowledge, this is the first rigorous evaluation of widely available LLMs for the task of evidence summarization in the PCD domain. In our evaluation of four LLMs for six PCD regimens, meaningful differences were evident across the LLMs. While Claude performed at a notably higher level, evidence synopses need to be as close to perfectly accurate and comprehensive to be usable for real-world clinical decision support. By this standard, even the best performing LLMs continue to require careful editing from a domain expert to become usable. Inaccurate and incoherent synopses could lead to suboptimal patient care if taken literally. We plan to repeat this experiment with newer generations of LLMs to evaluate their potential improvement over time.

Disclosures

Rubinstein:Sanofi: Consultancy; Johnson & Johnson: Consultancy; LAVA Therapeutics: Consultancy; Bristol Myers Squibb: Membership on an entity's Board of Directors or advisory committees, Other: Abemca. Banerjee:Adaptive; BMS; Caribou Biosciences; Genentech; GSK; JNJ / Janssen; Karyopharm; Legend Biotech; Pfizer; Sanofi; SparkCures: Consultancy; Abbvie; JNJ; Novartis; Pack Health; Prothena; Sanofi: Research Funding. Warner:NIH: Other: grants; Westat, The Lewin Group: Consultancy; HemOnc.org LLC: Other: ownership; Brown Physicians Incorporated: Other: grants. Cowan:Sanofi: Consultancy, Research Funding; Janssen: Consultancy, Research Funding; Juno/Celgene: Research Funding; Harpoon: Research Funding; Nektar: Research Funding; Regeneron: Research Funding; IgM Biosciences: Research Funding; Adaptive Biotechnologies: Consultancy, Research Funding; Caelum: Research Funding; Sebia: Consultancy; BMS: Consultancy, Research Funding; HopeAI: Consultancy, Current holder of stock options in a privately-held company; Abbvie: Consultancy, Research Funding.

This content is only available as a PDF.

2024

Sign in via your Institution

Evaluating the Accuracy of Artificial Intelligence(AI)-Generated Synopses for Plasma Cell Disorder Treatment Regimens

Cited By

Email alerts

ASH Publications

American Society of Hematology

Evaluating the Accuracy of Artificial Intelligence(AI)-Generated Synopses for Plasma Cell Disorder Treatment Regimens Free

This feature is available to Subscribers Only

My Account

Cited By

Email alerts

ASH Publications

American Society of Hematology

This Feature Is Available To Subscribers Only

Evaluating the Accuracy of Artificial Intelligence(AI)-Generated Synopses for Plasma Cell Disorder Treatment Regimens